InstallationTo use Weave’s predefined scorers you need to install some additional dependencies:LLM-evaluators
Update Feb 2025: The pre-defined scorers that leverage LLMs now automatically integrate with litellm.
You no longer need to pass an LLM client; just set the
This scorer checks if your AI system’s output includes any hallucinations based on the input data.Customization:
Use an LLM to compare a summary to the original text and evaluate the quality of the summary.How It Works:This scorer evaluates summaries in two ways:
The How It Works:
The Note: You can use
The Here is an example in the context of an evaluation:
The Here is an example in the context of an evaluation:
The
RAGAS -
The How It Works:
RAGAS -
The How It Works:
model_id
.
See the supported models here.HallucinationFreeScorer
This scorer checks if your AI system’s output includes any hallucinations based on the input data.- Customize the
system_prompt
anduser_prompt
fields of the scorer to define what “hallucination” means for you.
- The
score
method expects an input column namedcontext
. If your dataset uses a different name, use thecolumn_map
attribute to mapcontext
to the dataset column.
SummarizationScorer
Use an LLM to compare a summary to the original text and evaluate the quality of the summary.- Entity Density: Checks the ratio of unique entities (like names, places, or things) mentioned in the summary to the total word count in the summary in order to estimate the “information density” of the summary. Uses an LLM to extract the entities. Similar to how entity density is used in the Chain of Density paper, https://arxiv.org/abs/2309.04269
- Quality Grading: An LLM evaluator grades the summary as
poor
,ok
, orexcellent
. These grades are then mapped to scores (0.0 for poor, 0.5 for ok, and 1.0 for excellent) for aggregate performance evaluation.
- Adjust
summarization_evaluation_system_prompt
andsummarization_evaluation_prompt
to tailor the evaluation process.
- The scorer uses litellm internally.
- The
score
method expects the original text (the one being summarized) to be present in theinput
column. Usecolumn_map
if your dataset uses a different name.
OpenAIModerationScorer
The OpenAIModerationScorer
uses OpenAI’s Moderation API to check if the AI system’s output contains disallowed content, such as hate speech or explicit material.- Sends the AI’s output to the OpenAI Moderation endpoint and returns a structured response indicating if the content is flagged.
EmbeddingSimilarityScorer
The EmbeddingSimilarityScorer
computes the cosine similarity between the embeddings of the AI system’s output and a target text from your dataset. It is useful for measuring how similar the AI’s output is to a reference text.column_map
to map the target
column to a different name.Parameters:threshold
(float): The minimum cosine similarity score (between -1 and 1) needed to consider the two texts similar (defaults to0.5
).
ValidJSONScorer
The ValidJSONScorer
checks whether the AI system’s output is valid JSON. This scorer is useful when you expect the output to be in JSON format and need to verify its validity.ValidXMLScorer
The ValidXMLScorer
checks whether the AI system’s output is valid XML. It is useful when expecting XML-formatted outputs.PydanticScorer
The PydanticScorer
validates the AI system’s output against a Pydantic model to ensure it adheres to a specified schema or data structure.RAGAS - ContextEntityRecallScorer
The ContextEntityRecallScorer
estimates context recall by extracting entities from both the AI system’s output and the provided context, then computing the recall score. It is based on the RAGAS evaluation library.- Uses an LLM to extract unique entities from the output and context and calculates recall.
- Recall indicates the proportion of important entities from the context that are captured in the output.
- Returns a dictionary with the recall score.
- Expects a
context
column in your dataset. Use thecolumn_map
attribute if the column name is different.
RAGAS - ContextRelevancyScorer
The ContextRelevancyScorer
evaluates the relevancy of the provided context to the AI system’s output. It is based on the RAGAS evaluation library.- Uses an LLM to rate the relevancy of the context to the output on a scale from 0 to 1.
- Returns a dictionary with the
relevancy_score
.
- Expects a
context
column in your dataset. Usecolumn_map
if a different name is used. - Customize the
relevancy_prompt
to define how relevancy is assessed.
openai/gpt-4o
, openai/text-embedding-3-small
). If you wish to experiment with other providers, you can simply update the model_id
. For example, to use an Anthropic model: